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Abstract 

We propose new methods to speed up convergence of the Alternating Direction 
Method of Multipliers (ADMM), a common optimization tool in the context of 
large scale and distributed learning. The proposed method accelerates the speed 
of convergence by automatically deciding the constraint penalty needed for pa¬ 
rameter consensus in each iteration. In addition, we also propose an extension 
of the method that adaptively determines the maximum number of iterations to 
update the penalty. We show that this approach effectively leads to an adaptive, 
dynamic network topology underlying the distributed optimization. The utility of 
the new penalty update schemes is demonstrated on both synthetic and real data, 
including a computer vision application of distributed structure from motion. 


1 Introduction 

The need for algorithms and methods that can handle large data in a distributed setting has grown 
significantly in recent years. Specifically, such settings may arise in two prototypical scenarios: (a) 
induced distributed data: distribute and parallelize computationally demanding optimization tasks to 
connected computational nodes using a data distributed model and (b) intrinsically distributed data: 
data is collected across a connected network of sensors (e.g., mobile devices, camera networks), 
where some or all of the computation can be performed in individual sensor nodes without requiring 
centralized data pooling. Several distributed learning approaches have been proposed to meet these 
needs. In particular, the alternating direction method of multiplier (ADMM) m is an optimization 
technique that has been very often used in computer vision and machine learning to handle model 
estimation and learning in either of the two large data settings El 131313 0121 El i). 

In the distributed optimization setting, the distributed nodes process data locally by solving small 
optimization problems and aggregate the result by exchanging the (possibly compressed) local so¬ 
lutions (e.g., local model parameter estimates) to arrive at a consensus global result. However, the 
nature of distributed learning models, particularly in the fully distributed setting where no network 
topology is presumed, inherently requires repetitive communications between the device nodes. 
Therefore, it is desirable to reduce the amount of information exchanged and simultaneously im¬ 
prove computational efficiency through faster convergence of such distributed algorithms. 

To this end, the contributions of this paper are three fold. 

• We propose two variants of ADMM for the consensus-based distributed learning faster 
than the standard ADMM. Our method extends an acceleration approach for ADMM IITOll 
by an efficient variable penalty parameter update strategy. This strategy results in improved 
convergence properties of ADMM and also works in a fully distributed fashion. 
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(a) Centralized 




Figure 1: Centralized, distributed, and the proposed learning model in a ring network. The bigger 
size of pij means that corresponding constraint is more penalized. Solid edges denote currently 
strongly influencing edges and dotted edges indicate the edges with less influence. 


• We extend our proposed method to automatically determine the maximum number of it¬ 
erations allocated to successive updates by employing a budget magement scheme. This 
strategy results in adaptive parameter tuning for ADMM, removing the need for arbitrary 
parameter settings, and effectively induces a varying network communication topology. 

• We apply the proposed method to a prototypical vision and learning problem, the dis¬ 
tributed PPCA for structure-from-motion, and demonstrate its empirical utility over the 
traditional ADMM. 


2 Problem Description and Related Works 

The problem we consider in this paper can be formulated as a consensus-based optimization prob¬ 
lem im. A general consensus-based optimization problem can be written as 

.7 

argmin s.t. 9i = 0j,yi^j (1) 

' i=l 

where we want to And the set of optimal parameters 6i,i = 1..J that minimizes the sum of con¬ 
vex objective functions fi{9i), where J denotes the total number of the functions. This problem 
is typically a reformulation of a centralized optimization task argmin/(0) with a decomposable 
objective f{9) = X]i=i fiW- Given the consensus formulation, the original problem can be solved 
by decomposing the problem into J subproblems so that J processors can cooperate to solve the 
overall problem by changing the equality constraint to 9i = 9 where 9 denotes a globally shared 
parameter. The optimization can be approached efficiently by exploiting the alternating direction 
method of multiplier (ADMM) |[T1. 

The above consensus formulation is particularly suitable for many optimization problems that appear 
in computer vision. For instance, since fi{9i) can be any convex function, we can also consider a 
probabilistic model with the joint negative log likelihood fi{9i) = —logp{xi, Zi\9i) between the 
observation Xi and the corresponding latent variable Zi. Assuming {xi,Zi) are independent and 
identically distributed, finding the maximum likelihood estimate of the shared paramter 9 can then be 
formulated as the optimization problem we described above for many exponential family parametric 
densities. Moreover, the function need not be a likelihood, but can also be a typical decomposable 
and regularized loss that occurs in many vision problems such as denoising or dictionary learning. 

It is often very convenient to consider the above consensus optimization problem from the perspec¬ 
tive of optimization on graphs. For instance, the centralized i.i.d. Maximum Likelihood learning 
can be viewed as the optimization on the graph in Fig. [T^ Edges in this graph depict functional 
(in)dependencies among variables, commonly found in representations such as Markov Random 
Fields ||9l or Factor Graphs ifT^ . In this context, to fully decompose /(•) and eliminate the need for 
a processing center completely, one can introduce auxiliary variables pij on every edge to break the 
dependency between 9i and Oj iniiii as shown in Fig. This generalizes to arbitrary graphs, 
where the connectivity structure may be implied by node placement or communication constraints 
(camera networks), imaging constraints (pixel neighborhoods in images or frames in a video se¬ 
quence), or other contextual constraints (loss and regularization structure). 
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In general, given a connected graph Q = (V, £) with the nodes i,jGV and the edges = {i,j) G 
S, the consensus optimization problem becomes 

min s.t, — Pij-> Pij — ^ (^) 

iGV 

Solving that problem is equivalent to optimizing the augmented Lagrangian £(0) = 

Ci{&i) = fi{di) + ^ — pij) + ^Jj2{pij — ^j)| + ? ^ — PijW'^ + \\pij “ &jW^} , (3) 

jeSi ^ jeBi 

where © = {0^ : i G V}, ©i = Ai} are parameters to find, Ai = {Xiji,Xij 2 ■ j G Bi}, 

Xiji, Xij 2 are Lagrange multipliers, Bi = {j\eij G £} is the set of one hop neighbors of node i, 
p > 0 is a fixed scalar penalty constraint, and || • |1 is induced norm. The ADMM approach suggests 
that the optimization can be done in coordinate descent fashion taking gradient of each variable 
while fixing all the others. 

2.1 Convergence Speed of ADMM 

The currently known convergence rate of ADMM is 0(1/T) where T is the number of itera¬ 
tions M- Even though 0(1/T) is the best known bound, it has been observed empirically that 
ADMM converges faster in many applications. Moreover, the computation time per each iteration 
may dominate the total algorithm running time. Thus many speed up techniques for ADMM have 
been proposed that are application specific. One way is to come up with a predictor-corrector step 
for the coordinate descent m using some available acceleration method such as ini. It guarantees 
quadratic convergence for strongly convex /i( ). Another way is to replace the gradient descent 
optimization with a stochastic one mi US). This approach has recently gained attention as it greatly 
reduces the computation per iteration. However, these methods usually require the coordinating cen¬ 
ter node thus may not readily applicable to the decentralized setting. Moreover, we want to preserve 
the application range of ADMM and avoid introducing additional assumptions on 

One way to improve convergence speed of ADMM is through the use of different constraint penalty 
in each iteration. For example, ifTOl proposed ADMM with self-adaptive penalty, and it improved 
the convergence speed as well as made its performance less dependent on initial penalty values. The 
idea of im is to change the constraint penalty taking account of the relative magnitudes of primal 
and dual residuals of ADMM as follows 

of Ik‘ll2 > Mllsia 

M\\s%>p\\r% (4) 

, otherwise 

where t is the iteration index, /r > 1 , t* > 0 are parameters, r* and s* are the primal and dual 
residuals, respectiveljQ The primal residual measures the violation of the consensus constraints and 
the dual residual measures the progress of the optimization in the dual space. This update converges 
when r‘ satisfies ^ updating 77 * after a finite number of iterations. Typical 

choice for parameters are suggested as p. = 10 and r* = 1 at all t iterations. The strength of this 
approach is that conservative changes in the penalty are guaranteed to converge lUllol. However, 
like other ADMM speed up approaches mentioned above, this update scheme relies on the global 
computation of the primal and the dual residuals and requires the p* stored in nodes to be homo¬ 
geneous over entire network thus it is not a fully decentralized scheme. Moreover, the choice of 
parameters as well as the maximum number of iterations require manually tuning. 

3 Proposed Methods 

We present our proposed ADMM penalty update schemes in three steps. First, we extend the afore¬ 
mentioned update scheme of Q to be applicable on fully decentralized setting. Next, we propose 
the novel penalty parameter update strategy for ADMM speed up that does not require manual tun¬ 
ing of T*. Finally, we extend the strategy so that we can automatically select the maximum number 
of penalty update iterations. 

’Please refer [T], page 18 and 51 for their definitions. 


i 77* • (1-f r‘)-i 

I 77 * 
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3.1 ADMM with Varying Penalty (ADMM-VP) 


Throughout the paper, the superscript t in all terms with subscript i denote either the objective 
function or parameter at f-th iteration for node i. In order to extend for a fully distributed setting, 
we first introduce ryf, the penalty for z-th node at f-th iteration. Next, we need to compute local 
primal and dual residuals for each node i. In the fully distributed learning framework of iiniiii, 
the dual auxiliary variable vanishes from derivation. However, to compute the residuals, we need to 
keep track of the dual variable, which is essentially the average of local estimates, explicitly over 
iterations. The squared residual norms for the i-th node are defined as 


\rl\\l = \\0l-9l 


I2’ 


\4\\i = ivim-di¬ 


ll 


e] = 


1 

m 


E 

fee. 




(5) 


Note the difference from the standard residual definitions for consensus ADMM IT), used in Q, 
where the dual variable is considered as a single, globally accessible variable, 6* instead of local 
9\. This allows each node to change its ry* based on its own local residuals. The penalty update 
scheme is similar to but rf, ||r *||2 and ||s ‘||2 are replaced with pf, ||r |||2 and ||s*|| 2 . respectively. 
Lastly, Cl stopped hanging p* after t > 50. However, in ADMM-VP, if we stop the same way, 
we end up with heterogeneously fixed penalty values which impacts the convergence of ADMM by 
yielding heavy oscillations near the saddle point. Therefore we reset all penalty values in all nodes 
to a pre-defined value (e.g. rf, the initial penalty parameter) after a fixed number of iterations. As 
we fix the penalty values homogeneously after a finite number of iterations, it becomes the standard 
ADMM after that point thus the convergence of ADMM-VP update is guaranteed. 


3.2 ADMM with Adaptive Penalty (ADMM-AP) 


We further extend rji by introducing a bi-directional graph with a penalty constraint parameter 77 ^ 
specific to directed edge eij from node * to j. The modified augmented Lagrangian Ci is similar to 
Q except that we replace 77 with 77 ^. The penalty constraint controls the amount each constraint 
contributes to the local minimization problem. The penalty constraint parameter 77 ^ is determined 
by evaluating the parameter Oj from node j with the objective function ff-) of node i as 




rp , otherwise 


where is the maximum number of iterations for the update as proposed in IfTOl and 


(6) 


tI = 




- 1 . 


<(91) 

= max{/*( 6 »‘),/‘( 6 '*) 


{9) = 


fm-ff 


j^rnax _ j^rr, 

j&Bf, fr 


= min{/‘( 6 '*), /*( 6 »‘) : j G Bi} . 


(7) 

( 8 ) 


The interpretation of this update strategy is straightforward. In each iteration t, each i-th node will 
evaluate its objective using its own estimate of 9* and the estimates from other nodes 0 * (we use 
plj instead of actual 9* to retain locality of each node from the neighbors). Then, we assign more 
weight to the neighbor with better parameter estimate for the local /i(-) (i.e. larger penalty 77 L if 
fi{9j) < fi{9i)) with the above update scheme. The intuition behind the ADMM-AP update is 
to emphasize the local optimization during early stages and then deal with the consensus update at 
later, subsequence stages. If all local parameters yield similarly valued local objectives /i(-), the 
onus is placed on consensus. This makes ADMM-AP different from pre-initialization that does the 
local optimization using the local observations and ignores the consensus constraints. 

Note that unlike the update strategy of Q, we do not need to specify r* and the update weight is 
automatically chosen according to the normalized difference in the local objective evaluation among 
neighboring parameters. The proposed algorithm also emphasizes the objective minimization over 
the minimization that solely depends on the norms of primal and dual residuals of constraints. The 
hope is that we not only achieve the consensus of the parameters of the model but also a good 
estimate with respect to the objective. 


On the other hand, the convergence property of lITOll still holds for the proposed algorithm. Following 
Remark 4.2 of ifTOl . the requirement for the convergence is to satisfy the update ratio to be fixed after 
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somet™“ < 00 iteartions. Moreover, the proposed update ensures bounding by 77 *^^/? 7 |^ S [0.5,2], 
which matches with the increase and decrease amount suggested in lUIIOl. One may use = 50 
as in ifTOl . 


3.3 ADMM with Network Adaptive Penalty (ADMM-NAP) 

To extend the proposed method for automatically deciding the maximum number of penalty updates, 
the penalty update for the ADMM becomes 



(9) 


Fig. [T^ depicts how the proposed model have different structures from centralized and traditional 
distributed models, and how nodes share their parameters via network. 

In addition to the adaptive penalty update, the inequality condition on the summation of , it = l..t 


encodes the spent budget that the edge Cij can change . All nodes have its upper bound 7”* and 


everytime it makes a change to rj^j, it has to pay exactly the amount they changed. If the edge has 
changed too much, too often, the update strategy will block the edge from changing rjij any more. 

The update scheme is guaranteed to convergence if 7”* is simply set to constant T for all i, j, t 
or if T*j = 0 for t > However, with a different objective function and different network 

connectivity, a different upper bound should be imposed. This is because a given upper bound T 
or maximum iteration could be too small for a certain node to fully take an advantage of 
our adaptation strategy or they could be too big so that it converges much slowly because of the 
continuously changing lyk. To this end, we propose updating strategy for 7[‘ as following; 



( 10 ) 


where 7 ]° is set by an initial parameter T and a,/3 £ (0,1) are parameters. Whenever > Tlj, 
we increase n by 1 . Once J2u=i kiy I ^ ^ij 1**^1 objective value is still significantly changing, 
i.e. \fi{0l) — fi{dl~^)\ > j3, is increased by a”T. Note that the independent upper bound 
T^j for each 77 T update on the edge makes it sensitive to the various network topology, but it still 
satisfies the convergence condition because 



( 11 ) 


n—1 


3.4 Combined Update Strategies (ADMM-VP + AP, ADMM-VP + NAP) 


Observing Q and the proposed update schemes (j^ and ([^, one can easily come up with a combined 
update strategy by replacing r* in (Wli with r* . Based on preliminary experiments, we found that 


this replacement yields little utility, mstead, we suggest another penalty update strategy combining 
ADMM-VP and ADMM-AP as 



( 12 ) 


which we denote as ADMM-VP + AP. We reset pT = when t > In order to combine 

ADMM-VP and ADMM-NAP, we consider the summation condition of rf as in (j^. We denote this 
strategy as ADMM-VP + NAP. 

4 Distributed Maximum Likelihood Learning 

In this section, we show how our method can be applied to an existing distributed learning framework 
in the context of distributed probabilistic principal component analysis (D-PPCA). D-PPCA can be 
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viewed as fundamental approach to a general matrix factorization task in the presence of potentially 
missing data, with many applications in machine learning. 

4.1 Probabilistic Principal Component Analysis 

The Probabilistic PCA (PPCA) ED has many applications in vision problems, including structure 
from motion, dictionary learning, image inpainting, etc. We here restrict our attention to the linear 
PPCA without any loss of generalization. The centralized PPCA is formulated as the task of project¬ 
ing the source data x according tox = Wz-f/r-l-e where x G is the observation column vector, 
z € is the latent variable following z ^ A/'(0,I), W € is the projection matrix that 

maps X to z, /r G allows non-zero mean, and the Gaussian observation noise e ~ A/^(0, a“^I) 
with the noise precision a. When = 0, PPCA recovers the standard PCA. The posterior estimate 
of the latent variable z given the observation x is 

p(z|x) ~ - fi), (13) 

where M = W^W -f a~^l. The parameters W, and a can be estimated using a number of 
methods, including SVD and Expectation Maximization (EM) algorithm. 

4.2 Distributed PPCA 

The distributed extension of PPCA (D-PPCA) ifTTI can be derived by applying ADMM to the cen¬ 
tralized PPCA model above. Each node learns its local copy of PPCA parameters with its set of local 
observations = {xi„|n = l..Ai} where Xj„ denotes the n-th observation in i-th node and Ni 
is the number of observations available in the node. Then, they exchange the parameters using the 
Lagrange multipliers and impose consensus constraints on the parameters. The global constrained 
optimization is 

min -logp(Xi|0,) s.t. = pfj,p® = 0^, (14) 

where i G V, j G Bi, 0^ = {Wi, a*} is the set of local parameters and p® = {pf- , pfj, Pij} 
is the set of auxiliary variables for the parameters. Eor the details regarding how the decentralized 
model is optimized, see m. 

4.3 D-PPCA with Network Adpative Penalty 

The augmented Lagrangian applying the proposed ADMM with Network Adpative Penalty is similar 
to ITTTII except that 77 becomes pij. with Ai, 7 ^, Pi are Lagrange multipliers for the PPCA param¬ 
eters for node i. The adaptive penalty constraint pT controls the speed of parameter propagation 
dynamically so that the overall optimization empirically converges faster than m. One can solve 
this optimization using the distributed EM approach ifl^ . The E-step of the D-PPCA is the same 
as centralized counterpart ED- The M-step is similar to UM except we use separate rjij for each 
edge. Since the update formulas for the three parameters are similar, we present the fii update as an 
example. Eirst, fii can be updated as 

^ I “ WiE[zi„]) - 2'yl + ^ rpj + Pj) > • | Niai + 2 ^ 77 A j , (15) 

[ "=1 J \ j&Bi ) 

where E[zi„] denotes the posterior estimates of the n-th latent variable of node i. Note that unlike 
D-PPCA where we computed the normalization factor as Niat -f 2rj\Bi \ where | • | is the cardinality, 
we add up Vj G Bi. The corresponding Lagrange multiplier can be computed as penalty- 
weighted summation of consensus errors 7 *“''^ = 7 * + (1/2) ~ Once 

all the parameters and the Lagrange multipliers are updated, we update rjij and Tij using (j^ and 
respectively. Algorithm 1 in the appendix summarizes the overall steps for the D-PPCA with 
Network Adpative Penalty. 

5 Experiments 

We first analyze and compare the proposed methods (ADMM-VP, ADMM-AP, ADMM-NAP, 
ADMM-VP + AP, ADMM-VP + NAP) with the baseline method using synthetic data. Next, we 
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(a) 12 nodes (complete) 


(b) 16 nodes (complete) 


(c) 20 nodes (complete) 




■ ADMM 
■ADMM-AP 
■ADMM-NAP 
■ADMM-VP 
■ADMM-VP + AP 
■ADMM-VP + NAP 


(d) 20 nodes (ring) (e) 20 nodes (cluster) 


Figure 2; The comparison of proposed methods and the baseline ADMM using the subspace angle 
error of the projection matrix with (a-c) different graph size and (c-e) different network topology 


apply our method to a distributed structure from motion problem using two benchmark real world 
datasets. For the baseline, we compare with the standard ADMM-based D-PPCA llT4l denoted as 
ADMM. Unless noted otherwise, we used 77 ° = 10. To assess convergence, we compare the relative 
change of (141 to a fixed threshold (10“^ in this case) for the D-PPCA experiments as in iflTll . 


5.1 Synthetic Data 

We generated 500 samples of 20 dimensional observations from a 5-dim subspace following 
A/^(0,I), with the Gaussian measurement noise following A/^(0,0.2 • I). For the distributed set¬ 
tings, the samples are assigned to each node evenly. All experiments are ran with 20 independent 
random initializations. We measured the number of iterations to convergence and the maximum 
subspace angle error versus the ground truth defined as the maximum of subspace angles between 
each node’s projection matrix and the ground truth projection matrix. We examined the impact of 
different graph topologies and different graph sizes. We tested three network topologies: complete, 
ring and cluster (a connected graph consists of two complete graphs linked with an edge). For the 
graph size, we tested on 12, 16 and 20 nodes settings. 

Top three plots in Fig. [^depict results over varying number of nodes while hxing the graph topol¬ 
ogy as the complete graph. We plot the median result out of the 20 independent initializations. We 
observed that the speed up with the proposed method, particularly for ADMM-VP and its variants, 
becomes more signihcant as the number of nodes increases. This suggests the proposed method can 
be of particular use as the size of an application problem increases. Fig.|^to Fig.|^in the hgure 
show the performance in the context of different network topologies. Our proposed methods con¬ 
verge faster or at the same rate as the standard ADMM. The proposed method works most robustly 
in the complete graph setting. In other words as the graph connectivity increases, the convergence 
property of the proposed method improves. Note also that ADMM-VP works best in complete graph 
while ADMM-AP / NAP are better than the ADMM-VP in weakly connected networks. This makes 
sense as ADMM-VP depends on residual computation and the proposed local residual computation 
become less accurate compared to the complete graph when the global residual can be computed. 


5.2 Distributed Affine Structure from Motion 

We tested the performance of our method on five objects of Caltech Turntable ll22]l and Hopkins 
155 12^ dataset as in m. The goal here is to jointly estimate the 3D structure of the objects as 
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Figure 3: The comparison of proposed methods and the baseline ADMM using the subspace angle 
error of the reconstructed 3D structure with one object in Caltech dataset (Standing). Results on the 
remaining four objects can be found in the appendix. See Fig.|^for the plot labels. 

well as the camera motion, however in a distributed camera network setting. The input measurement 
matrix is defined as 2 x F" by where F denotes the number of frames and N denotes the number of 
points. By applying PCA, we can decompose the input into the camera pose and the 3D structure 
'E[zin],n = For the detailed experimental setting, refer to llT4l l24ll . As the performance 

measure, we used the maximum subspace angle error versus the centralized SVD-reconstructed 
structure. The network setting assumes hve cameras on a complete graph. 


Fig. 0 shows the result on the Caltech Turntable dataset. First, we compare Fig. and Fig. 

One can see that when the graph is less connected (Fig. |^, the proposed adaptive penalty method 
can boost ADMM-VP which cannot utilize the full residual information of fully connected case 
(Fig.[3b]l, as explained in synthetic data experiments. Next, we compare Fig. and Fig. 3c The 
network topologies are the same (complete) but value required for ADMM-VP, ADMM-AP, 
ADMM-VP + AP is different in these two groups of experiments. When = 50 (Fig. [3b] i, all 
methods can accelerate throughout the iterations. However, when = 5 (Fig. the methods 
that depend on cannot accelerate after 5 iterations thus showing behavior similar to the baseline 
ADMM. On the other hand, ADMM-NAP based methods can accelerate by adaptively modifying 
the maximum number of penalty updates. Note that one can choose any small value of T and %j is 
increased automatically using ([T0|). 


For the Hopkins 155 dataset, we compared methods on 135 objects using the same approach as m. 
For each method considered, we computed the mean number of iterations until convergence. Since 
some objects in the dataset are point trajectories of non-rigid structure, it is inevitable for simple lin¬ 
ear models to fail for those objects. Thus we omitted objects yielded more than 15 degrees when cal¬ 
culating the mean. For each object, we tested 5 independent random initializations. For ADMM-AP, 
ADMM-NAP and ADMM-VP + NAP, we found no significant speed up over the baseline ADMM. 
For ADMM-VP and ADMM-VP + AP, we could obtain 40.2%, 31.3% speed up, respectively if we 
use complete network. In ring network, the amount of improvement becomes smaller. This small or 
no improvement of speed is mainly due to the fact that the baseline ADMM converges fast enough 
(typically < 100 iterations) thus there is little room for the proposed methods to speed up the opti¬ 
mization. As observed from the synthetic experiments and Caltech dataset, the acceleration of the 
proposed methods occurs at the earlier iterations of the optimization. Thus if one can come up with 
a better convergence checking criterion depending on the application, the proposed methods can be 
a very viable choice due to its parameter-free nature. 


6 Conclusion 


We introduced a novel adaptive penalty update methods for ADMM that can be applied to consen¬ 
sus distributed learning frameworks. Contrary to previous approaches, our adaptive penalty update 
methods, ADMM-AP and ADMM-NAP does not depend on the parameters that require manual 
tuning. Using both synthetic and real data experiments, we showed the empirical effectiveness of 
the methods over the baseline. In addition, we found that the performance of ADMM-VP decreases 
with weakly connected graphs, and in those cases, ADMM-AP and ADMM-NAP can be useful. 

The proposed methods do leave some room for improvements. For the problems when the standard 
ADMM can converge fast enough, the proposed methods may show less than signihcant gains. 
A better convergence criterion may help stop the proposed algorithms at earlier iterations (e.g. a 
criterion that can stop algorithms to remove long tails in Fig.|^or Fig.[2c|). 
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A D-PPCA with Network Adaptive Penalty 


Here we summarize the distributed probabilistic principal component analysis (D-PPCA) m al¬ 
gorithm modified to use the proposed network adaptive penalty update scheme (ADMM-NAP). We 
follow the notations from the previous sections. The D-PPCA with Network Adaptive Penalty algo¬ 
rithm is summarized in Algorithmic 


Algorithm 1 D-PPCA with Network Adpative Penalty 


Require: For every node i randomly initialize W°, 
for j G Bi 

1 : for f = 0 , 1 , 2 , • • • until convergence do 
2: for all i G V do 

3: Compute E[zi„] and E[z 


and set A° = 0, 7 ° = 0, (3° = 0, p.j = rj 




and to Vj G Bi 


Compute W-+\ /xf, af 

end for 

for all z G V do 

Broadcast wf /x, 
end for 
for all z G V do 
Compute 
end for 

for all z G V do 

Update r]ij for j G Bi via 
Update Tij for j G Bi via |Tg| 
end for 
end for 


B Results on Caltech Turntable Dataset 

We present example image frames from the Caltech Turntable ll22l dataset used in lfT4l . We compare 
the proposed methods, ADMM with Varying Penalty (ADMM-VP), ADMM with Adaptive Penalty 
(ADMM-AP) and ADMM with Network Adaptive Penalty (ADMM-NAP) and their combination 
(ADMM-VP + AP, ADMM-VP + NAP) with the standard ADMM based D-PPCA iflTl using the 
same experimental setting. Fig. shows an example frame, feature points extracted from the frame 
and the centralized SVD-based reconstructed structure we used as ground truth. In the paper, we 
showed the results of Standing. 

Fig. 0 summarizes the results on the remaining four objects. The findings and analysis explained 
in the main paper on the object Standing also apply to these four remaining objects. First, we 
compare the top and the middle rows. One can see that when the graph is less connected (ring, 
top row) the proposed adaptive penalty method can boost ADMM-VP which cannot utilize the full 
residual information of fully connected case (complete, middle row), as explained in synthetic data 
experiments. 

Second, we compare the middle and the bottom rows. The network topologies are the same as com¬ 
plete but value required for ADMM-VP, ADMM-AP, ADMM-VP + AP is different from these 
two groups of experiments. When = 50 (middle row), all methods can accelerate throughout 
the iterations. However, when = 5 (bottom row), the methods that depend on cannot 
accelerate after 5 iterations thus show similar behaviour as the baseline ADMM. On the other hand, 
ADMM-NAP based methods could accelerate by adaptively modifying the maximum number of 
penalty updates. 
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(c) Rooster (189 points) 




(e) StorageBin (102 points) 


Figure 4: The Caltech Turntable dataset objects used in IIT4ll and the centralized SVD-based affine 
structure from motion result. Green dots on the image frame show the feature points tracked. All 
objects were tracked for 30 frames. The frames are distributed evenly to the 5 cameras. 



(a) BallSander 
(62 points) 


(b) BoxStuff 
(67 points) 


(c) Rooster (d) StorageBin 

(189 points) (102 points) 


Figure 5: The comparison of proposed methods and the baseline ADMM using the subspace angle 
error of the reconstructed 3D structure with different objects in Caltech dataset, (top) = 50, 
ring, (middle) = 50, complete, (bottom) = 5, complete network. Refer Fig. 2 in the 
main paper for the labels. 
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